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1S t 8900 - 1978

Indian Standard
CRITERIA FOR THE REJECTION OF ~OUTLYING OBSERVATIONS
0. FOREWORD
0.1 This Indian Standard was adopted by the Indian Standards Institution on 25 July 1978, after the draft finalized by the Quality Control and Industrial Statistics Sectional Committee had been approved by the Executive Committee. 0.2 An outlying observation or an `outlier' is one that appears to deviate markedly from the other observations of the sample in which it occurs. An outlier may arise merely because of an extreme manifestation of the random variability inherent in the data or because of the non-random errors, such as gross deviation from the prescribed experimental procedure, mistakes in calculations, errors in recording numerical values, other human errors, loss of the calibration of an instrument, change of measuring instruments, etc. If it is known that a mistake has occurred, the outlying observation must be rejected irrespective of its magnitude. If, however, only a suspicion exists, it may be desirable to determine whether such an observatien may be rejected or whether it may be accepted as part of the normal variation expected. 0.3 The procedure consists in testing the statistical significance of the outlier(s). A null hypothesis ( assumption ) is made that all the observations including the suspect observations come from thesame population ( or lot ) as the other observations in the sample. A statistical test is then applied to determine whether this null hypothesis can be rejected at the specified level of significance ( see 2.8 ). If so, the outliers can then be taken to have come from a population(s) different from that of the other observations in the sample and hence the outlier(s) can be rejected. 0.3.1 This standard provides certain statistical criteria which would be useful for the identification of the outlier(s) on an objective basis. When so identified, necessary investigations may also be initiated wherever possible to find *out. the assignable causes which gave_ rise to the .outlier(s). zh~Foother objectlves of the statrstrcal tests for locatmg the outher may e : a) screen the data before statistical analysis; b) sound an alarm that outliers are present, thereby indicating the need for a closer study of the data generating process; and c) pin-point the observations which may be of special interest just because they are extremes. 3
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0.4 When a test method recommends more than one determination for reporting the average value of the characteristic under consideration, the precision ( repeatability and reproducibility ) of the test method is found to be useful in detecting observations which deviate unduly from the rest. For further guidance in this regard, reference may be made to IS : 5420 ( Part I )-1969*. When the -precision of the test method is not known quantitatively, one or more of the procedures covered in the standard may be found helpful. The tests outlined in this standard primarily apply to the observations in a single random sample or the experimental data as given by the replicate measures of some property of a given material. 0.5 Although a number of statistical tests based on various considerations are available to screen the given data for outliers, only those tests which are simple and efficient have been included in this standard. 0.5.1 In the case of a single outlier ( the smallest or the largest suspect observation ), two tests are available. One test is based on the standard deviation whereas the other test is based on the ratio of differences between certain order statistics, that is, the observations when they are arranged in the ascending or descending order of magnitude. The latter test would be more useful in those cases where the calculation of standard deviation is to be avoided or a quick judgement is called for. 0.5.2 In the case of two or more outliers at either end of the ordered sample observations, the test given is based on the ratio of the sample sum of squares when the doubtful observations are omitted to the sample sum of squares when the doubtful observations are included, the sample sum of squares being defined as the sum of the squares of the deviations If simplicity in of the observations from the corresponding mean. calculation is the prime requirement then *the test based on the order statistics for the case of single outliers may be used by actually omitting one suspect observation in the sample at a time. However, this test is to be applied with caution because the overall level of significance of the test may change due to its repeated applications. 0.53 In the case of two or more outliers such that one is at least at each end of the ordered sample observations, two tests have been given. One test is based on the ratio of the range to the standard deviation whereas the other test is based on the largest residuals, a residual being defined as the deviation of an observation from the corresponding mean. It may be mentioned that the former test is applicable when only two observations, namely, the largest and the smallest observations, are suspect whereas the latter test is much more general. 0.6 In a given set of data, whenever a large number of observations ( say more than 25 percent ) are found to be outlying, it may be desirable to *Guide on precision of test methods: Part I Principles and applications. 4
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discard the entire data. However, this guideline is to be applied with considerable caution in those situations where the data consists of a few observations only. 0.7 Almost all the statistical criteria for the identification of the outliers as given in this standard are based on the assumption that the underlying distribution of the observations is normal. Hence, it is important that these criteria are not used indiscriminately. In case the assumption of normality is in doubt, it is advisable to obtain the guidance of a competent statistician to ascertain the feasibility of the applicability of these test criteria. 0.8 In reporting the result of a test or analysis, if the final value, observed or calculated, is to be rounded off, it shall be done in accordance with IS : 2-1960*.

1. SCOPE 1.1 This
standard lays down the criteria observations in the following three situations: a) Single outlier b) Two or more outliers observations ), and ( at either for the detection sample of the of outlying ),

( at either end of the ordered end

observations ordered

sample

c) Two or more outliers ( one at least at each of the two ends of the ordered sample observations ). 1.2 In situations other than those enumerated appropriate to conduct test for non-normality to be covered in a separate standard. in 1.1, it may be more of observations which are

2. TERMINOLOGY 2.0 For the purpose
2.1 2.2 Sample Sample Size of this standard, the following definitions shall apply.

-Collection (n) -

of items from a lot. Number of items in a sample. of the values of the observations

2.3 Mean ( Arithmetic divided by the number

) ( ?? ) - Sum of observations.

2.4 Standard Deviation ( s) - The square root of the quotient obtained by dividing sum of the squares of deviations of the observations from their mean by one less than the number of observations in the sample. 2.5 Variance Square of the standard deviation.

*Rules for rounding off numerical values ( reuised).

5

IS : 8900 - 1978 difference 2.6 Range -The vations in a sample. between the largest and the smallest obser-

2.7 Null Hypothesis - Hypothesis ( or assumption ) that all the observations in the sample come from the same parent population, distribution or lot. 2.8 Level of Significance - The probability ( or risk ) of rejecting the null hypothesis when it is true. Conventionally it is taken to be 5 or 1 percent. 2.9 Statistic A function calculated from the observationsin the sample.

2.10 Critical Value - The value of the appropriate statistic which would be exceeded by chance with some small probability equivalent to the level of significance chosen. 2.11 Degrees values which 3. TESTS of FreedomThe number of independent are necessary to determine a statistic. SINGLE OUTLIER component

FOR

3.0 There are many situations when, in a given sample of size n, one of the observations, which is either the largest or the smallest, is suspect. To detect such outliers, two tests have been given. The first one needs the calculation of the average and the standard deviation of the sample observations whereas the second one is based on order statistics and hence is easier to operate. 3.1 Let x1, x2.. . . . .x, be the n sample observations arranged in the ascending order of magnitude so that xl < x2 < ..,...... <x,,. If the smallest obserR - X1 vation x1 is suspect, the test criterion 7-r = may be used, where the
S

mean as:

( Z)

and

the standard

deviation

(s ) of the observations

are given

n x i=l ~ e: Xl + x2 . . . . . . . . . xn e--n n / f= d ; i=l (Xi---R)2 n1

a

If the calculated value of Z-, is greater than or equal to the corresponding tabulated ( critical ) value of II in Table 1 for sample size x and the chosen level of significance, x1 shall be treated as an outlier and rejected; otherwise no doubt would be cast on the validity of x1. 6
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3.1.1 If the largest observation q is doubtful, then the criterion x, - Z If the calculated, value of In is greater shall be calculated. z-11 = s than or equal to the corresponding critical value in Table 1 for sample size ta and the specified level of significance, xn would be rejected as an outlier; not otherwise. 3.1.2 Example I - The tensile strength ( in megapascals, MPa ) of 10 specimens of brass rods of 10 mm diameter for general engineering purposes obtained from a lot were obtained as 368,370, 370,370,372, 372, 372, 380, 384 and 397. The doubtful observation is the highest value x10=397 MPa. The mean ( z ) and the standard deviation ( s ) for the 10 results are obtained as: Z e 375.5 MPa, and s = 9.06 MPa.

Since the calculatedvalue ( 2.373 ) of 3-,, is greater than the tabulated value 2.176 of In, in Table 1 for the sample size 10 and 5 percent level of significance, the largest value 357 may be treated as an outlier and rejected. 3.2 The second criterion for the detection of a single outlier is based on the If+ x2 . . . . . .Xn ratios of differences between relevant ordered observations. are the ordered set of observations ( arranged in the ascending order of magnitude ) then depending upon the total number of observations in the sample, certain ratios of differences shall be calculated as indicated in Table 2. If these ratios exceed the corresponding critical values at the chosen level of significance, the existence of an outlier is indicated; not otherwise. The ratio to be calculated is indicated as r,, when the sample size is 7 or less. For sample sizes from 8 to 10, the ratio to be calculated is rli. -For sample sizes from 11 to 13, the relevant ratio for computation is y2r whereas for samples from 14 to 25, the ratio rs2 is to be computed. ( The connotation of r's when the smallest value or the largest value is suspect is indicated in Table 2. ) 3.2.1 Example 2- If the second criterion ( see 3.2 ) is applied to the data of Example 1 to find out whether the largest observation is an outlier or not, then the ratio to be used for a sample of size 10 is: 13 _ oS481 397-384 X0-X0-1 = r11 = 397-370 =27 xn--)cz Since the calculated value ( O-481 ) of ~1%is greater than the tabulated value 0.477 in Table 2 for sample size 10 and 5 percent level of significance, This inference is the the largest value 397 may be treated as an outlier. same as drawn on the basis of the first criterion in Example 1. 7
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4.0 When two or more observations at the lower end or the upper end of the ordered set of results deviate unduly from the rest, one way of detecting the outliers is by the repeated application of the tests recommended in 3 for single outlier. Although this procedure is simple, it has the disadvantage that the overall level of significance of the test would not be the same as in the case of test for a single outlier. Hence this test should be applied with caution. However, separate tests have also been developed when more than one observation at either end of the ordered set of observations are suspected to be outliers. 4.1 Let x1, x2.. . . . . . . .xn be the IZ observations in the sample arranged according to the ascending order of magnitude. If it is desired to test whether the k largest observations are outliers, the procedure as given below may be followed: The mean ( R ) and the sum of the squares ( S2 ) of the deviations of the observations from their mean are calculated as follows:

s* =

n 2(l~i-Z)~

= (n-l

)s2

Leaving out the k suspect observations, the mean (A-k ) and the sum of squares ( S*n_k ) of the remaining ( n-k ) observations are calculated as:
ii&k =

x1+ x2+ . . . . . . . . . .X&k n-k
k ;i,

i=

1

s*n_k =

(Xi-h-k S2 sB
n-k

)"
which is the test criterion, is then computed.

The ratio Lk =

The calculated value of Lk is compared with the corresponding critical value given in Table 3 for the sample size n and the chosen level of significance. If the calculated value of Lk is found to be less than the corresponding tabulated value, then the k largest observations will be considered as outliers, not otherwise. It may be noted that for the criterion Lk and the Table 3, the smaller values denote the significance and not the larger values as in the case of the earlier criteria and Tables 1 and 2. 8
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4.1.1 In case
procedure

the k smallest observations happen to be suspect, followed would be exactly similar to that given in 4.1.

the

4.1.2 Example 3 In the course of preparation of standard reference sample of bauxite, 13 splits were analyzed by a laboratory for the percentage content of silicon dioxide ( SiO, ). The test results were 3.74, 3.76, 3.78, 3.78, 3.78, 3.84, 3.84, 3.85, 3.89, 3.90, 3.90, 3.98 and 4.01. The two largest be outliers. From all the Leaving observations, namely, 3.98 and 4.01 are suspected and S213 = 0.084 11 observations, to

13 observations,

Zrs = 3.85

out the 2 suspect

results from the remaining

frr = 3.82 and SalI = 0.034 0.034 s21, LZ = __ = m = 0.405 s**s Since the calculated value ( 0.405 ) of Lz is greater than the tabulated value 0.337 of L, in Table 3 for the sample size 13 and 5 percent level of significance, it is inferred that there is not enough evidence to suspect the two largest observations as outliers. Hence 5. TESTS FOR TWO OR MORE OUTLIER AT EACH END) OUTLIERS m( AT LEAST ONE

5.0 When there are two or more suspect observations such that they occur at both ends of the ordered set of sample results, two statistical tests of The first test which is significance for their detection have been given. based on the ratio of the sample range to the sample standard deviation is applicable only when two outliers are present ( one each at the two ends of the ordered sample results ) . The second test of significance is more broadbased and is applicable to all situations where two or more suspect observations occur in such a manner that at each end of the ordered sample results there is at least one suspect observation. 5.1 Test for Two Outliers ( One Outlier at Each of the Two Ends ) - Let x1, x2, . . . . . xn be the n sample observations arranged in the ascending order of magnitude. When both x1 and x,, are the suspect observations, the ( z ), the standard deviation (s) and the range ( R ) are first calculated, the range being given by R = xn - x1. The ratio R/s, which constitutes the test criterion, is then calculated and compared against the critical values given in Table 4 for the sample size n and the specified level of significance. If the calculated value of R/s is larger than the corresponding tabulated value, then both x1 and xn are considered to be outliers, not otherwise. 5.1.1 Example 4-The shearing strength ( in kg) of 15 specimens obtained from a batch of plywood tea-chest panels when arranged according to ascending order of magnitude were as follows: 87.5, 88.7, 92.9, 93.3, 93.6, 94.5, 94.7, 95.0, 95.2, 95.4, 96.1, 97.2, 98.3, 100.0, 105.7

9

IS : 8900 I 1978 If the smallest and the largest observations, namely, 87.5 and 105.7, are suspected to be outliers, then from all the 15 observations range ( R ) and standard deviation ( s ) are obtained as: R = 105.7 87.5 = 18.2 15 I Z (xl--E)s 1 14

S=

= 4.32

18.2 Hence R/s = 4.y2 = 4.21 Since the tabulated value ofR/s in Table 4 corresponding to a sample size 15 and 5 percent level of significance is 4.17 ( which is less than the calculated value of 4.21 ), the smallest and largest values are considered as outliers. 5.2 Test for More Than Two Outliers (At Least One Outlier at Each End ) - Let x1, x, . . . . . . xn be the sample observations arranged If two or more ( say k ) of these in the ascending order of magnitude. observations are suspect such that there is at least one suspect observation at each end of the ordered observations in the sample, the procedure as given below may be followed: Then The mean ( z ) of the sample observations is first calculated. the residuals, that is, the absolute deviations of the observations from their mean are calculated as:
I *13 I , Ixz--zq, Ix,-21 . . . . . . )%I-21

~These n residuals are then arranged in the ascending order of magnitude and designated as z,, zz . . . . z,,. Thus z1 corresponds to that sample observation x which is nearest to the mean Z and zn to the sample observation x which is farthest from the mean 5. The mean i and the sum ofsquares ( Uz ) of these z values are then calculated as follows:

Leaving out these z values ( k in number ) which correspond to the k suspect observations of the original sample, the remaining ( n - k ) values are taken and their mean (En-k ), that is, the mean of the ( n - k ) least 10

IS : 8988 - 1978 extreme values, as also the sum of squares ( U2n _ k ) are calculated ?Z- k
z
zn-k =

as:

i-

1

G

U',_k

12k fl= I: k ( r?l - ;,,_k i = 1

)"

The ratio &
k and the desired

=

u2 n-k u2-

,

which is the test criterion,

is then compared

with the critical value given in Table 5 for the sample size 12, the value of level of significance. If the calculated value of & is found to be smaller than the corresponding tabulated value, then the k original suspect values are considered to be outliers, not otherwise. It may be noted that smaller values of & denote significance as in the case of Lk in Table 3. 5.2.1 Exarr@e 5- In the data on shearing strength given in Example 4, if the smallest two observations 87'5 and 88.7 and the largest observation 105.7 are suspect to be outliers,,then the mean (2) is first calculated as 95.2. The residuals, that is, the absolute deviations of the sample observations from their mean are then found out as 7.7, 6.5, 2.3, 1.9, 1*6,0*7, 0.5, 0.2, 0, 0.2, 0.9, 2.0, 3.1, 4.8 and 10.5. If these residuals (zr ) are arranged in the ascending order of magnitude, we get 0, 0.2, 0.2, 0.5, 0.7, 0.9, 1.6, 1.9, 2.0, 2.3, 3.1, 4.8, 6.5, 7.7 and 10.5. Now the 3 residuals corresponding to the 3 original suspected outliers in the new sample are the last three values, namely, 6.5, 7.7 and 10.5 which would be suspect and tested for their significance. From all the 15 residuals we get the mean ( i1s ) as Z?rs= 2.86 15 = 138.84 u2,s = z (ZI - 215 )"
i = 1

Leaving out the 3 largest residuals ( Z values ), we get from the remaining 12 values: Lr2 I= 1.52 12 u212= I: (Zi -&,)2 = 22.14 i= 1 22.14 BP= u212 ~ = 0.159 Hence Es 138.84 U%S 11

IS : 8900 - 1978 Since the calculated value ( 0,159 ) of Es is less than 0,206 which is the _tabulated value of Ek in Table 5 for the sample size 15, k - 3 and 5 percent level of significance, all the three observations, namely, 87.5, 88.7 and 105.7, are to be considered as outlietx
TABLE 1 CRITICAL VALUES OF Tl OR Tn FOR TESTING ( Clause 3.1 )
SAMPLE SIZE,

AN OUTLIER

fl

~_.._---_.-h-_--_ 5 Percent 1.153 1.463 1.672 1.822 1.938 2032 2.110 2.176 2.234 2.285 2.331 2.371 2.409 2.443 2.475 2.504 2.532 2.557 2.580 2.603 2.624 2.644 2.663 2.745 2.811 2866 2.914 2956

SIGNIFICANCE LEVEL

---7 1 Percent 1.155 1.492 I.749 1,944 2.097 2'221 2.323 2.410 2,485 2.550 2.607 2.659 2.705 2,747 2.785 2.821 2.854 2.884 2.912 2.939 2.963 2.987 3.009 3.103 3.178 3240 3.292 3.336

3 4 5 G 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 35 40 45 50

12
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TABLE

2

CIiITICAL VALUES OF CRITERIA BASED ON ORDER STATISTICS FOR TESTING AN OUTLIER : Clause 3.2 j
SICNIPrCANCe LEVEL
r--L-_*_`_---9

\VHEN ~_..____~_____
Srrlallrst

l-ILK

OUTLIEll

IS

SAMHA _
~ SIZE,

Valur

Largest Value

II

5 Percrnt

1 Percent

`lb

Aa2

-

",

.rn xn -

xn-1 x1

3 4

0.941 ti 765 0.642 0.560 0.507

0~988 (1.889 0.780 0698 0.637

xn -

Xl

5 6 7

'11

x2

-

A-1

rn -

Xn-1

8

0.554 0512 0.477

0.683 0.635 0.597

xn- 1 - x1

tg - xp

9 iu

1111

xj
xn-_l

-

A-1

xu

~--

-

xn_y
.x:2

II 12 13

0.376 0.546 0.521

0.679 0642 0.615

x

I

X"

`lx

*3 - x1 ___xn-P Xl

xn xn -

xn-0 x3

14 15 16 17 18 19 20 `1 22 23 24 25

03E 0.525 0.507 0.490 0.475 0.462 0.450 0.440 0.430 0421 0413 0.406

0.641 0616 0.595 W577 0.561 0547 0.535 0.524 0.514 0.505 0.497 0~489

13
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TABLE

3

CRITXCAL 2

VALUES 3

OF Lx FOR

-k

T _5%

T _
1%
~5%

T c I.002
:I.048
I.028

T ..
1%

-4

-i-

THE DETECTION ( Cluuse 4.1 ) 5
I

OF k ( LARGEST 6
7-

OR SMALLEST

i OUTLIERS

`i-

-

--

1

I

_

a
7-

!

__-

-

9 -

-I54' .d

5%

10'

ro

50' 10

10' /0

10' /0

5%

10 '
`0

5%

-4 : 7 8 9 10

-: 1.018
PO01 C PO55 CPI06 PO00 PO04 PO21 0.010 bO47 0.032 PO76 0.064 I.112 0.099 t.142 0.129 l.178 l.208 I.233 l.267 J.294 I.311 l.338 l.358 I.366 P387 P488 l.526 I.574 1.608 l.636 I.668 0.162 0,196 0.224 0.250 0.276 0.300 0.322 0.331 0.354 0377 0.450 0506 0.554 0.588 0618 0.646

-

-_

-

-_

_----_ -

--

c PO10
0.008 tiola 0032 0052 0.070 0.094 0.113 0.132 0.151 0.171 0.192 k201 0.231 0.374 0.434 @482 0,523 0556 0588 0.308 0.369 0.418 0460 0.498 0,531 -_ (PO34 0.012 I.054 : PO76 PO98 : l.122 (PI40 pi59 : P181 (l.200 I.209 : P238 I.312 i J.376 (I.424 (J.468 P502 : I.535 O-026 0.038 0.056 0.072 0.090 pi08 0.126 @140 I 0154 t 0.175
IPO42 I9.060 I0.079

E C
(I.270 j.305 : j.337 I.363 t I.387 (P410 I.427 i 1.447 l.462 l.484

0022 0.045 C1.070 0.070 I.098 E I.120 L-147 : I.172 I I.194 l.219 :.b.237 ( k260 (. I.272 c P300 0.098 0.125 @I50 0.174 PI97

IPO97 I0.115 IPl54

3.019 3.033 DO42 >057 I.072 PO91 3.104 >I18 1.136 3204 1.36a 3.321 Y364 1.399 1438

I.050 PO66 YO82 )*I00 PllG 1.130 Pl50 I.222 I.283 l.334 I.378 I.417 P450

PO27 PO37 >049 PO64 )_076 1.088 1.104 l.168 I.229 P282 l.324 I.361 I*400

I3.136
I3.168 I3.188 IP262

D*O55 I3,030
0.072 0086 0.099 w115 0.184 0245 0.297 0.342 0.382 0.414
I3044
IDO53 ID-064

I 1.550 I l.599 (I.642
40 :;

IPO78 I0144

3.062 I.074 PO88 3'154 I.212 J'264 I.310 ; I.350 3'383

I.056 PO46 I (PO58
(1166

JO66 0.126 0183 0235 0.280 0.320 0.356

:b434

P377

I

(X.72 l.696 I.722

( k484 j.522 l.558 c I.592

' 0246 ' 0.312
0.364 0408 0444 0483
1

I3,327
I3376 I0.42

I

1 I3.456 I3.490

I3*250 I0.292 I`3.328

IP196

(PI12

;I.262

Y220

1 2

368

(I.296 1 1.336
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TABLE 4 CRITICAL OUTLIER ( ONE

VALUE OF R/s FOR THE DETECTION OF TWO AT EACH END OF ORDERED RESULTS ) ( Clnuses 5.1 and 5.1.1 )

SAMPLE

n

.%ZE,

SIGNIFICANCE

I------- ------5 Percent

LEVXM _.&__--_.---_ 1 Percenr 2.00 2.45 2.80 +i+lU 3.34 3.54 3.72 3.88 4.01 413 4'24 4.34 443 451 4.59 4.66 473 *79 ,

3

2-00 2.43 2.75 3.01 3.22 3-40 3.55 3.68 3.80 3.91 4-00 4.09 4.17 424 4,.31 4.38 4.43 4.49

+ 5 li 7 8 9 10 11 12 13 14 15 16 17 18 19 20

30 40 50

4.89 5.15 535

5.25 5.54 577

15

TABLE

5

CRITICAL

VALUES

OF Ek FOR THE DETECTION

OF k ( SOME SMALL

AND OTHERS

LARGE)

OUTLIERS

( Clauses 5.2 and 5.2.1 k 2

)

-I -.
, 1%
5%

3

-F

5 6

I ---.-f-----
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0009 O-013

0.004
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